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Abstract 



A Hilbert space embedding for probability measures has recently been proposed, wherein 
£ — . ' any probability measure is represented as a mean element in a reproducing kernel Hilbert 

space (RKHS). Such an embedding has found applications in homogeneity testing, indepen- 
dence testing, dimensionality reduction, etc., with the requirement that the reproducing 
kernel is characteristic, i.e., the embedding is injective. 



In this paper, we generalize this embedding to finite signed Borel measures, wherein any 



finite signed Borel measure is represented as a mean element in an RKHS. We show that 
the proposed embedding is injective if and only if the kernel is universal. This therefore, 
provides a novel characterization of universal kernels, which are proposed in the context of 
achieving the Bayes risk by kernel-based classification/regression algorithms. By exploiting 
this relation between universality and the embedding of finite signed Borel measures into 
an RKHS, we establish the relation between universal and characteristic kernels. 
Keywords: Kernel methods, Characteristic kernels, Hilbert space embeddings, Universal 
kernels, Translation invariant kernels, Radial kernels, Probability metrics, Binary classifi- 
cation, Homogeneity testing. 

1. Introduction 

Kernel methods have been popular in machine learning and pattern analysis for their su- 
perior performance on a wide spectrum of learning tasks. They are broadly established as 
an easy way to construct nonlinear algorithms from linear ones, by e mbedding data points 



into h igher dimensional reproducing kerne l Hilbert spaces (RKHSs) (jScholkopf and Smola 



20021 ; IShawe- Taylor and Cristianinil . 12004 ). Recently, this idea has been generalized to 



embed probability distribut ions into RKHSs, which provides a linear method for dealin 



with higher order statistics (jGretton et all 120071 ; ISmola et all 120071 ; iFukumizu et all . 1200 
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2009bl : ISriperumbudur et all I2008L l2009al lbh . Formally, given the set of all Borel probability 



measures defined on the topological space X, and the RKHS (!K, k) of functions on X with 
k : X x X —7- R as its reproducing kernel (r.k.) that is measurable and bounded, any Borel 
probability measure, P is embedded as, 



X 



k(-,x)dF(x). 



(1) 



Such an embed ding has been found to be useful in many st atistical applications like homo- 



genei ty testing (jGretton et alll2007h. independence testing (IGretton et al.l . l2008l : lFukumizu et al 



2008), dimensionality reduction ( Fukumizu et al. . 2004 . 2009a ). etc., as it provides a pow- 
erful and straightforward method of dealing with higher-order statistics of random vari- 
ables. However, in these applications, it is critical that the embedding in (pQ) is injec- 
tive so that probabi li ty m easures can be distinguished by their images in "K. To this 
end, Fukumizu et al. (|2008l ) introduced the notion of characteristic kernel — a bounded, 
measurable k is said to be characteristic if CO is in j ective — for which many character- 



izations have recently been provid ed (IGretton et al 
Sriperumbudur et all I2OO8L l2009al lbh . 



20071 : iFukumizu etUl l200fiL l2009bl : 



A natural extension to the above idea of embedding probability measures into an RKHS, 
"K is to embed finite signed Borel measures, \i into "K as 



(2) 



{1 1 y I k(-,x) dfj,(x), 



x 



and study the conditions on the kernel, k for which such an embedding is injective. Al- 
though the embedding in ([2]) can be proposed and investigated for mathematical pleasure, 
we show as one of the main contributions of this paper that under certain conditions on 
H and X, the embedding in ([2]) is closely related to the concept of universal kernels (see 
Section 11.11 for t he formal introduction to universal kernels), which was first proposed by 
SteinwartJ (j2"oOlh - in the context of achieving the Bayes risk in kernel- b ased classifica- 
tion/r egres sion algorithms — and later extended by Micchelli et al. ( 20061 ). Carmeli et al 



(|2009h and ISriperumbudur et all feoioh l 1 ! This connection shows that the embedding m 
([2]) is not just an abstract mathematical object, but has applications in kernel-based clas- 
sification/regression algorithms. Using the connection between ([2]) and universal kernels, 
we then show how the various notions of universality mentioned above are related to each 
other. In addition, since the embedding in ([2]) is a generalization of the embedding in ([I]), 
we also demonstrate the relation between cha racteristic kernels and univers al kernels, which 
extends the preliminary study carried out in ISriperumbudur et al.l (|2009bl . Section 3.4). 

In the remainder of this introduction, we provide a comprehensive overview of our 
contributions which are presented in detail in later sections. First, in Section ll.l| we 
introduce universality, briefly discuss various notions of universality that are proposed in 
literature, and outline our contribution: a measure embedding view point of universality, 
which is novel and different from the existing view point of approximating functions in some 
target space by functions in an RKHS. We show that a kernel is universal if and only if the 
embedding in ([2]) is injective. Second, in Section ll.2| we discuss our second contribution of 
relating universal and characteristic kernels. 



The present paper is an extended version of ISriperumbudur et al.l (|2010l ). 
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1.1 Contribution 1: Injective RKHS embedding of finite signed Radon 
measures to characterize universality 

In the regularization approach to learning (|Evg eniou et all l200d ). it is well known that 
kernel -based algorithms (for classification /regression) gene rally invoke the representer the- 



orem (jKimeldorf and Wahbal . I197Q : IScholkopf et al 
has the representation, 



E 

j € N n } C 



200 ll ) and learn a function in "K that 
k{-,Xj), (3) 



where N n := {1,2, ... ,n} and {cj : j £ N n } C R are parameters ty pically obtained from 
training data, {xj : j G N n } C X. As noted in iMicchelli et al.l ( 20061 ) . one can ask whether 



the function, / in (|3"| approximates any real-valued target function arbitrarily well as the 
number of summands increases without bound. This is an important question to consider 
because if the answer is affirmative, then the kernel-based learning algorithm is consistent in 
the sense that for any target function, /* (which is usually assumed to belong to some subset 
of the space of real- valued continuous functions defined on X), the discrepancy between / 
(which is learned from the training data) and /* goes to zero (in some sense) as the sample 
size goes to infinity. Since 



Cjk(-,Xj) : n G N, {c,} C R, {xj} C X 



is dense in "K (lAronszajnl . Il950l ). and assuming that the kernel-based algorithm makes / 
"converge to an appropriate function" in 'K as n — >• oo, the above question of approximating 
/* arbitrarily well by / in ([3]) as n goes to infinity is equivalent to the question of whether "K 
is rich enough to approximate any f* arbitrarily well, i.e., whether 9i is universal. We show 
that characterizing universal RKHSs (or equivalently, the characterization of corresponding 
reproducing kernels (r.k.) as any RKHS is uniquely determined by its reproducing kernel) 
leads to the embedding in ([2]). 

As mentioned above, the goal is to characterize "K that allow to approximate any /* in 
some target space, usually assumed to be some subset of the space of real-valued continuous 
functions on X. Therefore, depending on the choice of X, the choice of tar get space and the 



type of approximation, various notions of universality have been p r opose d (jSteinwartl . 12001 



Micchelli et all » IcWli et all Isrinernmhndnr et all »), whlch^r7briefly 

discussed in the following paragraphs. The eventual goal is to have a notion of universal- 
ity that allows comprehensive (and general) necessary and/or sufficient conditions on the 
reproducing kernel for approximating, as strong as possible, a class of target functions, as 
general as possible. 

c-universality: Let C(X) denote the space of continuous real-valued functions on some 
topological space, X. ISteinwartl (|200ll ) considered the above approximation problem when 
X is a compact metric space, with /* G C{X) and defined a continuous kernel, k as 
universal (in this paper, we refer to it as c-universal) if its associated RKHS, J£ is dense in 
C(X) w.r.t. the uniform norm (see Section[2]for the definition of uniform norm), i.e., for any 
/* G C(X), there exists a g G "K that uniformly approximates /*. In the context of learning, 
this indicates that if a kernel is c-universal, then the corresponding kernel-based learning 
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algorithm could be consistent in the sense that any target function, /* G C(X) could be 
approximated arbitrarily well in the uniform norm by / in ([3]) as n goes to infinity (see 
Steinwart and Christmann ( 20081. Corolla ry 5.29) for a ri gorous result). B y applying the 
Stone- Weierstrafi theorem ( Folland . 19991 . Theorem 4.45) . [Steinwart ( 200ll ) then provided 
sufficient conditions for a kernel to be c-universal, using which the Gaussian kernel is shown 
to be c-universal on every compact subset of M. d . 

As our contribution, in Section 13. 1\ we completely characterize c-universal kernels by 
showing that k is c-universal if and only if the embedding in ([2j) is injective for \x G Mb(X), 
the space of finite signed Radon measures defined on a compact Hausdorff space, X (see 
Section [2] for a formal definition of Mf,(X)). It has to be noted that this result is different 
from and more general — as both necessary and sufficient conditions are provided — than 



the one bv lSteinwartl ([200 ll . Theorem 9), where only a sufficient condition is provided. Using 
this characterization, as a special case, we also obtain necessa ry and sufficient conditions 
for a Fourier kernel (see Section I3.3|) to be c-universal, while Steinwart ( 200ll ) provided 
only a sufficient condition. 



cc- universality: One limitation in the setup considered by ISteinwartJ (|200ll ) is that X is 
assumed to be compact, which excludes m any interesting space s, such as M. d and infinite 
disc rete sets. To overcom e this limitation, ICarmeli et al.l (120091 . Definition 2, Theorem 3) 
and Sriperumbudur et al. ( 2010l ) approximated any /* G C(X) by some g G "K uniformly 
over every compact set, Z C X, by defining a continuous kernel, k to be universal (in 
this paper, we refer to it as cc-universal) if the corresponding RKHS, "K is dense in C{X) 
with the topology of compact convergence, where X is a non-compact Hausdorff space. I.e., 
for any compact set Z C X, for any f* G C(Z), there exists a g G "K\z that uniformly 
approximates /*. Here, C{Z) is the space of all continuous real- valued functions on Z 
equipped with the uniform norm, 'K^ := {f\z '■ f G ^} is t ne restriction of "K to Z and 
f\z is the restriction of / to Z. 

As our contribution, in Section I3.1j, we show that k is cc-universal if and only if the 
embedding in (|2|) is injective for fj, G Mf, c (X), the space of compactly supported finite 
signed Radon meas ures defined on a no n-compact Hausdorff space, X. Compared to the 
characterization by ICarmeli et al.l (120091 . Theorem 4), which deals with the injectivity of a 
certain integral operator on the space of square- integr able functions, our characterization 
is easy to understand — as it is related to a generalization of the embedding in (pQ) — and 
will naturally lead to understanding the relation between cc-universal and characteristic 
kernels. Using this characteri zation, we also show that k is cc-universal if and only if 
it is universal in the sense of Micchelli et al.l ( 20061 ): for any compact Z C X, the set 
K(Z) : = span{fc(-, y) : y G Z} is dense in C(Z) in the uniform norm (see Remark 0(b); 
also see Carmeli et al. (|2009l . Remark 1)). As examples, m any popular kernels o n M. d are 
shown to be cc-universal (see Sections 13.21 and 13. 4\ also see Micchelli et al.1 ( 2006 . Section 
4)): Gaussian, Laplacian, i^+i-spline, sine kernel, etc. 

co-universality: Although cc-universality solves the limitation of c-universality by han- 
dling non-compact X, the topology of compact convergence considered in cc-universality is 
weaker than the topology of uniform convergence, i.e., a sequence of functions, {f n } C C(X) 
converging to / G C(X) in the topology of uniform convergence ensures that they converge 
in the topology of compact convergence but not vice- versa. So, the natural question to ask is 
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whether we can characterize "K that are rich enough to approximate any /* on non-compact 
X in a stronger sens e, i.e., uniformly, by some g € " K. Recently, this has b e en an swered by 
Carmeli et al. ( 20091 . Definition 2, Theorem 1) and Sriperumbudur et al. ( 2010l ). wherein 



they defined k to be CQ-universal if k is bounded, k(-,x) G Co(X), Vx £ X and its corre- 
sponding RKHS, IK is dense in Cq(X) w.r.t. the uniform norm, where X is a locally compact 
Hausdorff (LCH) space and Co(X) is the Banach space of bounded continuous functions 
vanishing at infinity, endowed with the uniform norm (see Section [2] for the definition of 
C (X)). 

As our contribution, in Section 13.11 we present the following necessary and sufficient 
condition for a kernel to be co-universal: k is CQ-universal if and only if the embedding in 
([2]) is injective for fj, £ Mb(X). It can be seen that this characterization naturally leads 
to understand the relation between CQ-universal and characteristic kernels , which is not 
straightforward with the characterization obtained by Carmeli et al. (j2009l . Theorem 2), 



wherein CQ-universality is characterized by the injectivity of a certain integral operator on 
the space of square- integr able functions. Using this result, simple necessary and sufficient 
conditions are derived for translation invariant kernels on M. d (see Section 13. 2 p . Fourier 
kernels on T d , the d- Torus (see Section I3.3f) and radial kernels on M. d (see Section 13. 4p to 
be CQ-universal. Examples of c^-universal kernels on R d include the Gaussian, Laplacian, 
i^z+i-sphne, inverse multiquadratics, Matern class, etc. 

q,- universality: The definition of CQ-universality deals with "K being dense in Cq(X) 
w.r.t. the uniform norm, where X is an LCH space. Although the notion of CQ-universality 
addresses limitations associated with both c- and cc-universality, it only approximates a 
subset of C(X), i.e., it cannot deal with functions in C(X)\Co(X). This limitation can be 
addressed by considering a larger class of functions to be approximated. 

To this end, we propose a notion of universality that is stronger than CQ-universality: 
k is said to be c^-universal if its corresponding RKHS, !K is dense in Cb(X), the space of 
bounded continuous functions on a topological space, X (note that Cq{X) C Cb(X)). This 
notion of c^-universality is more applicable in learning theory than cq- universality as the 
target function, /* can belong to Cb(X) (which is a more natural assumption) instead of it 
being restrained to Cq{X) (note that Cq(X) only contains functions that vanish at infinity). 
We show in Section [3~T1 that k is Cb-universal if and only if the embedding in (|2J) is injective 
for // belonging to a certain class of set functions (see Section [2] for the definition of set 
functions) defined on a normal topological space, X (see Theorem [6] for details). Because 
of the technicalities involved in dealing with set functions, in this paper, we do not fully 
analyze this notion of universality unlike the other aforementioned notions, although it is 
an interesting problem to be resolved because of its applicability in learning theory. 

Based on the above discussion that relates injectivity of the embedding in fl2J) to various 
notions of universality, we also show how these notions of universality are related. If X is 
compact, the notions of o, cc-, cq- and Cb-universality are equivalent. On the other hand, 
if X is not compact, the notion of cq- universality is stronger than cc-universality. I.e., if a 
kernel is cg-universal, then it is cc-universal but not vice- versa (for example, the Gaussian 
kernel on R rf is shown to be CQ-universal and therefore is cc-universal, while the sine kernel 
is cc-universal but not co-universal). We show in Section [3.41 that the converse is true in 
the case of radial kernels on M d . Similarly, when X is not compact (but an LCH space), the 
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notion of c^-universality is stronger than cq- universality, and therefore cc- universality. A 
summary of the relationship between various notions of universality is shown in Figure [TJ 

To summarize our first contribution, we show that, by appropriately choosing X and n 
in ([2|), the injectivity of the embedding in ([2]) completely characterizes various notions of 
universality that are proposed in literature. Using this connection between universality and 
the injectivity of the embedding in ([2]), we relate all these notions of universality, which is 
summarized in Figured) 



1.2 Contribution 2: Relation between characteristic and universal kernels 

Gretton et all (j2007h related universality and the characteristic property of k by showing 



that if k is c-universal, then it is characteristic. Besides this result, not much is known or 
understood about the relation between universal and characteristic kernels. In Section 14. 1^ 
we relate universality and characteristic kernels by using the results in Section 13.11 that 
relate universality and the RKHS embedding of Radon measures. As an example, we show 
that a translation invariant kernel on M. d (in general, any locally compact Abelian group) or 
a radial kernel on M. d is c p-universal if and only if it is characteristic. We also show that the 
converse to the result bv lGretton et al.l (120071) is not true, i .e., if a kernel is characteristic, 



it need not be c-universal (see ISriperumbudur et all l2009bl . Corollary 15). A summary of 



the relation between universal a nd characteristi c kern els is shown in Figure [TJ 

Using the embedding in ([1]), Gretton et al. ( 20071 ) proposed a metric, called the max- 
imum mean discrepancy (MMD), on the space of all Borel probability measures, when k 
is characteristic. One importan t theore t ical q uestion that is usually considered for met- 
rics on probability measures is (iDudleyl . I2002I . Chapter 11): "What is the nature of the 
topology induced by the probability metric in relation to the usual weak topology?" In 
probability theory, this question is important in understanding and proving central limit 
theorems. Although k being characteristic is sufficient for MMD to be a metric, we show 
in Section 14.21 that a notion stronger than the characteristic property is required to answer 
the above question. In particular, we show in Proposition [23] that if X is an LCH space 
and k is CQ-universal, then the topology induced by MMD coincides with the usual weak 
topology on the space of Radon probability measures defined on 

xE 

This result can be used 

to compare MMD to other probability metrics, such as the Dudley metric, total variation 
distance, Wasserstein distance, etc. We refer to lSriner^b.H.r et ail gffl) 

for a detailed 

study on the comparison of MMD to other probability metrics. 
To summarize, our main contributions in this paper are: 

(a) To establish the relationship between various notions of universality and the RKHS 
embedding, shown in ([2]), of finite signed Radon measures, and in turn present a 
novel measure embedding view point of universality compared to the classical function 
approximation view point. 

(b) To clarify the relationship between universal and characteristic kernels. 



2. ISriperumbudur et al.1 (|2009bl ) showed that if X is a compact metric space and k is c-universal, then the 
topology induced by MMD coincides with the usual weak topology. The result for non-compact X was 
left as an open question and is addressed in this paper, by applying the notion of co-universaltty. 
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A summary of the results in this paper is shown in Figure [TJ In the following section, we 
introduce the notation and some definitions that are used throughout the paper. Supple- 
mentary results used in proofs are collected in Appendix A. 

2. Definitions & Notation 

Let A be a topological space. C(A) denotes the space of all continuous functions on A. 
C(A) is the space of all bounded, continuous functions on X. For a locally compact 
Hausdorff space, A, / G C(X) is said to vanish at infinity if for every e > the set 
{x : > e} is compact. The class of all continuous / on A which vanish at infinity 

is denoted as Co (A). The spaces C(A) and Co (A) are endowed with the uniform norm, 
|| • \\ u defined as ||/|| u := sup xeX \f(x)\ for / G C (A) C C b (A). 

If Y denotes a topological vector space, we denote by Y' the vector space of continuous 
linear functionals on Y, and Y' is called the topological dual space (in this paper, we simply 
refer to it as the dual). 

For a set A, we denote its interior as A°. 

Radon measure: A signed Radon measure pna Hausdorff space A is a Borel measure 
on A satisfying 

(i) n(C) < oo for each compact subset C C A, 

(ii) fJ-(B) = sup{/u(C) | C C B, C compact} for each B in the Borel cr-algebra of A. 

H is said to be finite if ||/x|| := |/x|(A) < oo, where |/i| is the total- variation of [i. M^_(X) 
denotes the space of all finite Radon measures on A while Mb (A) denotes the space of all 
finite signed Radon measures on A. The space of all Radon probability measures is denoted 
as Mj(A) := {fx G M\(X) : fi(X) = 1}. For [i G M b (X), the support of fj, is defined as 

supp(^) = {x G A | for any open set U such that x G U, \^\{U) ^ 0}. (4) 

Mfc c (A) denotes the sp ace of all compac tly supported finite signed Radon measures on A. 



We refer the reader to iBerg et al.l (|1984l . Chapter 2) for a general reference on the theory 
of Radon measures. 

Finitely additive, regular set function: A set function is a function defined on a family 
of sets, and has values in [— oo, +oo]. 

A set function /i defined on a family r of sets is said to be finitely additive if G r, 
^(0) = and n({Jf =1 A[) = Yld=i ^i^i) ■> f° r ever y finite family {A\,...,A n } of disjoint 
subsets of r such that Uf =1 Ai G r. 

A field of subsets of a set A is a non-empty family, S, of subsets of A such that G S, 
AGS, and for all A,5 6 E, we have AuBeSand B\A G S. 

An additive set function \i defined on a field S of subsets of a topological space A is said 
to be regular if for each A £ T, and e > 0, there exists B G X whose closure is contained in 
A and there exists C G X whose interior contains A such that |/u(D)| < e for every D G S 
with D := C\B. 

Positive definite (pd), strictly pd and conditionally strictly pd: A function k : 
A x A — > M. is called positive definite (pd) (resp. conditionally pd) if, for all n G N {resp. 
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A holds 
V/x G M 6 (Jf) 



p. . 



^- ~ A holds 

^= V/x e M bc (X) 



A 



P. 8 



A 



T. 6 



P. 8 



Co-universal 

A 



P. 20 



" cc-universal 

' A 



p. ii 



P. 21 



P. 8 



characteristic 



strictly pd 



\ P. 14 

& 
T. 15 



P. 21 & R. 12(a) 



Jft : JJ X k(x, y) dfi(x) d[i(y) > 
(a) 



P. 8 



Co-universal 
A 



& holds 

6 M b (X) 

A 



c-universal ~^ C cc-universal 

A A 

M / V P. 14 

V P. 21 P. 8 \ 



P. 20 



7(4, 



characteristic 



77^ 



T. 15 
I 

strictly pd 



P. 14 & T. 15 

* : JJx k ( x > y) d i i ( x ) d ^iv) > 

(b) 



supp(A) = R d 



;(supp(A))° + 



p. 11 



p. 11 



Y 

Co-universal 



T. 6 



A 



cc-universal 



P. 21 



P. 11 



P. 8 



P. 8 



?(*) 



characteristic ^ y strictly pd 

R. 12(a) 

* : ip 6 Cb(R d ) PI L x (K d ) 
(c) 



supp(^) ^ {0} 



P. 16 



co-universal 



p. 16 



p. 21 



characteristic 



cc-univers 



P. 16 



Y 

strictly pd 



al 



(d) 



Figure 1: Summary of results: The relationships between various notions are shown along 
with the reference. The letters "P", "R" and "T" refer to Proposition, Remark 
and Theorem respectively. For example, P. 7 refers to Proposition 7. The im- 
plications which are open problems are shown with "?" . The trivial implications 
are shown without any reference, (a) X is an LCH space. Refer to Section [5] 
for the definition of M\,{X) and M^ C {X). (b) The implications shown hold for 
any compact Hausdorff space, X. However, when X = T d , the d- Torus, with 
k(x,y) = ip((x — y)mod27r)) where yj S C(T d ) is a positive definite (pd) func- 
tion, the implication between characteristic and strictly pd, shown as (-A2) is 
valid, which follows from Proposition Q3] and Theorem [T5J (c) X = M. d and 
k(x,y) = yj(x — y), where ip G Cb(M d ) is a pd function and the Fourier trans- 
form of a finite non-negative Borel measure, A (see Theorem 1101 for details). If 
vb G Cf,(IR rf ) n L 1 (R d ), then the implication shown as (dfc) holds. Otherwise, it is 
not clear whether the implication holds. For a set A, A° represents its interior, 
(d) X = M. d and k(x,y) = <p(\\x — ylH), where <p is the Laplace transform of a 
finite non-negative Borel measure, v on [0, 00) (see (|2ip ). 
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n > 2), a.\, . . . , a n € M (resp. with Y^=i a j = 0) an d all a?i, ...,46 X, we have 

n 

aiajk(xi,Xj) > 0. (5) 

Furthermore, k is said to be strictly pd (resp. conditionally strictly pd) if, for mutually 
distinct x\, . . . , x n £ X , equality in ([5]) only holds for a\ = ■ ■ ■ = a n = 0. 

Fourier transform in M. d : For X C M. d , let L P (X) denote the Banach space of p-power 
(p > 1) integrable functions w.r.t. the Lebesgue measure. For / £ L 1 (R d ), f and / represent 
the Fourier transform and inverse Fourier transform of / respectively, defined as 

f(y) := (2tt)-3 / e^* f(x) dx, y eR d , (6) 

JR d 

/» := (2tt)-3 / e ixTy f(y) dy, x £ M d , (7) 



where i denotes the imaginary unit \/— 1. For a finite Borel measure, fi on the Fourier 
transform of ^ is given by 

A(w) = / e" iwT:!; d/i(a:), w G M d , (8) 
which is a bounded, uniformly continuous function on M. d . 

Holomorphic and entire functions: Let D C C d be an open subset and / : D — > C be 
a function. / is said to be holomorphic at the point zo E D if 

,// x r /(^o) ~ /(g) 

/ (zq) := hm s 

z->zo Zo — 2 

exists. Moreover, / is called holomorphic if it is holomorphic at every zq € D. f is called 
an entire function if / is holomorphic and D = C d . 



3. Characterization of Universal Kernels 

In Section [H we have briefly discussed the relation between the embedding in ([2]) and various 
notions of universality. In Section [3.11 we present and prove our main result (Theorem [6|) , 
which relates universality and the embedding in ([2]). Theorem [6] shows that under appro- 
priate assumptions on and X, the injectivity of the embedding in ([2]) is necessary and 
sufficient for a kernel to be c-, cc-, cq- or Cb-universal. Using this result, it is shown that 
the notion of CQ-universality is stronger than that of cc- universality, i.e., if k is co-universal, 
then it is cc-universal but not vice- versa. Then, in Proposition [HI we obtain alternate nec- 
essary and sufficient conditions for the embedding in ([2]) to be injective, which resembles 
a condition for the kernel to be strictly pd (but not quite so!). However, in Proposition [HJ 
we show that strict positive definiteness of A: is a necessary condition for the embedding in 
([2]) to be injective, i.e., for k to be universal. Using the characterization obtained in Propo- 
sition [8j in Sections I3.2H3.51 we derive characterizations for universality that are easy to 
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check, for specific classes of kernels, e.g., translation invariant kernels on and T d , radial 
kernels on M , Taylor-type kernels on IR rf , etc. The results of this section are summarized 
in Figure [TJ 

Before characterizing various notions of universality, let us revisit their formal defini- 
tions. 

Definition 1 (c-universal) A continuous kernel k on a compact Hausdorff space X is 
called c-universal if the RKHS, K induced by k is dense in C{X) w.r.t. the uniform norm, 
i.e., for every function g E C(X) and all e > 0, there exists an f E K such that \\f —g\\ u < e- 

Definition 2 (cc-universal) A continuous kernel k on a Hausdorff space X is said to be 
cc-universal if the RKHS, K induced by k is dense in C(X) endowed with the topology of 
compact convergence, i.e., for any compact set Z C X, for any g E C(Z) and all e > 0, 
there exists an f E Ji\z such that \\f — g\\ u < e. 

Definition 3 (co-universal) A bounded kernel, k with k(-,x) E Cq(X), Vx £ X on a 
locally compact Hausdorff space, X is said to be CQ-universal if the RKHS, K induced by 
k is dense in Cq(X) w.r.t. the uniform norm, i.e., for every function g E Cq(X) and all 
e > 0, there exists an f E K such that \\f — g\ u < e. 

Definition 4 (q,-universal) A bounded continuous kernel, k on a topological space, X , is 
said to be c^-universal if the RKHS, K induced by k is dense in Cj>(X) w.r.t. the uniform 
norm, i.e., for any g E Cb(X) and all e > 0, there exists an f E K such that \\f — g\ u < e. 

First note that the above definitions are valid only if K is included in the appropriate 
target space, i.e., C( X) for c- and cc-universality , Cg(X ) for co-universality, and Cb(X) 



for Cb- universality. By Steinwart and Christmann (j2008l . Lemma 4.28, Theorem 4.61), the 



assumptions made on the kernel in the above definitions ensure that the definitions are 
valid. Also note that all these definitions are equivalent when X is compact as Cq(X) = 
Cb(X) = C(X) for compact X. When X is not compact, it is easy to see that Cb-universality 
is stronger than co-universality, i.e., if k is Cb-universal, then it is also co-universal, but not 
vice- versa. On the other hand, it is not straightforward to see how the notions of cc-universal 
and CQ-universal are related when X is non-compact. By characterizing CQ-universality and 
cc-universality, Theorem [6] in the following section, shows that the notion of CQ-universality 
is stronger than cc-universality, i.e., if a kernel is CQ-universal, then it is cc-universal, but 
not vice- versa. Based on these results, it follows that Cb-universality is stronger than cc- 
universality (but not vice- versa), when X is non-compact. 

3.1 Main results 

Before we state our main result, i.e., Theorem [6j we need t he following result, usually 



referred to as the Hahn-Banach theorem, whi ch we quote fr om iRudinl ()199ll . Theorem 3.5) 



(also see the remark following Theorem 3.5 in IRudinl (jl99ll )). 



Theorem 5 (Hahn-Banach) Suppose A be a subspace of a locally convex topological vec- 
tor space Y. Then A is dense in Y if and only if A 1 - = {0}, where 

A 1 - := {T E Y' : Vx E A, T(x) = 0}. (9) 
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The following main result of this paper, which presents a necessary and sufficient condition 
for k to be c-, co, cq- or c b -universal. hinges on the above theorem, where we choose A to 
be the RKHS, "K and Y to be C{X), C {X) or C b (X) for which Y' is known through the 
Riesz representation theorem. 

Theorem 6 (Characterization of universal kernels) The following hold: 

(a) Let X be a compact Hausdorff space with k being continuous. Then k is c-universal 
if and only if the embedding, 



fi4 k(;x)dfi{x), neM b (X), (10) 



X 



is injective. 



(b) Let X be an LCH space and k £ C b (X x X). Then k is cc-universal if and only if the 
embedding, 

fj,^ I k(-,x)dfj,(x), /xeM 6c (X), (11) 

is injective. 



x 



(c) Let X be an LCH space with the kernel, k being bounded andk(-,x) £ Cq(X), \/ x £ X. 
Then k is co-universal if and only if the embedding, 



fi4 k{-,x)dn{x), fieM b (X), (12) 



x 



is injective. 



(d) Let X be a normal topological space and let M rba (X) be the space of all finitely additive, 
regular, bounded set functions defined on the field generated by the closed sets of X. 
Then, a bounded continuous kernel, k is c b -universal if and only if the embedding, 

^ / k(-,x)dfx(x), [i e M rba (X), (13) 
Jx 

is injective. 

Proof First, we prove (c), from which (a) follows. 

(c) By Definition [3l k is CQ-universal if "K is dense in Cq(X). We now invoke Theorem [5] 
to characterize the denseness of "K in Cq(X), which means we ne ed to consider the dual 
Cq(X) := (Co(X))' of C (X). By the Riesz representation theorem (|Follandl . fl999l . Theorem 
7.17), Cq(X) = M b (X) in the sense that there is a bijective linear isometry fj, i— >• T^ from 
M b (X) onto Cq(X), given by the natural mapping, 

W)= / fdn, feC (X). (14) 
Jx 

Therefore, by Theorem [5j "K is dense in Cq(X) if and only if 

-K 1 := L € M b (X) :V/6J{, J f dfi = j = {0}. (15) 
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( ) Suppose ([12]) is injective, i.e., for (j, E M^X), j x k(-,x) d/j,(x) = => [J, = 0. Then by 
Lemma [26] (see Appendix A), we have 

/ fdn=(f,f k(;x)d»(x))^ = 0,VfE0{^[X = 0, 
Jx J X 1 

which by (|15p means IK is dense in Co (A") and therefore k is co-universal. 

( =>■ ) We need to prove that if % is dense in Co (A) then (j x k(-,x) dfx(x) = = 0) 
holds. This is equivalent to showing that if ( J x k(-,x) dfi(x) = fj, = 0) does not hold, 
then IK is not dense in Co(A). Suppose ( J x k(-,x) dfj,(x) = = 0) does not hold, i.e., 
3 7^ E -Mfe(A) such that J x k(-,x) d/j,(x) = 0, which means 3 / /U E M&(A) such that 
J x f dfj, = for every / E K, then, by (fT5"]) . K is not dense in Co (A). 

(a) When A is compact, Co(A) coincides with C(A), which means c-universality and cq- 
universality are equivalent. Therefore, k is c-universal if and only if the embedding in (|10p 
is injective. 

(6) The proof is similar to that of (a) except that we need to consider the dual of C(A) 
endowed with the topology of compact convergence (a locally conve x topological vector 
space) to characterize the denseness of K in C(A). It is known (Hewitt, 195dl ) that 



C'(A) = Mb c (A) in the sense that there is a bijective linear isometry /i H > from M?, C (A) 
onto C'(A), given by the natural mapping, T^(f) = j x f dfx, f E C(A). The rest of the 
proof is verbatim with Mb(X) replaced by Mb c (A). 

(d) The proof is very similar to that of (a) , wherein we identify (Ch(A )y = M r b a {X) such 



that T E (Cfe(A))' and fi E M rba (X) satisfy T(f) = f x f dp, f e C b (X) (iDunford and SchwartzJ . 



1958, p. 262). Here, = represents the isometric isomorphism. The rest of the proof is ver- 
batim with Mfc(A) replaced by M r b a (X). ■ 

Theorem [6] can also be interpreted as: for appropriate assumptions on A and /it, the embed- 
ding in ([2]) is injective if and only if the kernel is universal, therefore relating universality 
and injective RKHS embedding of finite signed Radon measures. In other words, Theorem[6] 
provides a novel measure embedding view point of universality compared to its well-known 
function approximation view point. Based on Theorem [6j the following remarks can be 
made. 

Remark 7 (a) Theorem^ provides a necessary and sufficient condition fo r c-universality — 
k is c-universal if and only if the embedding in [W\) is injective — while \Steinwari 



provided only a sufficient condit ion (in terms of the feature maps being an algebra; see 
Steinwart and Christmanrl [ 200 A . Theorem 4- 56) for details) usi ng the Stone- Wei erstraj] theorem. 



Therefore, Theorem^ differs from and generalizes the result by Steinwart (200 A ) 



(b) Note that the embedding in ill]) is injective if and only if for any compact set Z C A, 
the embedding 

|i4 / k(-,x)d/i(x), fj, E M b (Z), (16) 



z 



is injective. Micchelli et al. (200d\ . Proposition 1) have shown that for any compact set 



Z C A, the embedding in §16}) is injective if and only if the set K(Z) = span{A;(-, y) : y E Z} 
is dense in C{Z) w.r.t. the uniform norm. Therefore, it is clear that k is cc-universal if and 
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only if it is universal in the sense o nMicchelli et al. See also \Carmeli et all 12009 . 

Remark 1). 

(c) By comparing the embeddings in [77]) and [12]) , since Mb c (X) C M\,(X), it is clear 
that CQ-universality is stronger than cc-universality, i.e., if a kernel is CQ-universal (satis- 
fies H2])). then it is cc-universal (satisfies ill])). In general, the converse is not true (see 
Proposition [77] and Example [7]). However, we will show these notions to be equivalent in 
the case of radial kernels on W 1 (see Proposition [Ft 



(d nCarmeli et all 1(2009 . Theorems 2,4) provided characterizations for cq- and cc-universality 
in terms of the injectivity of an integral operator on the space of square- integrable functions, 
whereas our characterizations in Theorem [6] deal with the injectivity of an embedding that 
maps finite signed Radon measures into an RKHS, K. Since the latter can be seen as a 
generalization of the embedding in (QP that deals with characteristic kernels, our character- 
izations can be used in a straightforward way to relate universal and characteristic kernels 
(see Section^ for details). 

(e) Note that M r i, a (X) in H3]) does not contain any measure — though a set function in 
M r b a (X) can be extended to a measure — as measures are countably additive and defined 
on a a -field. Since [i in Theorem\B(d) is not a measure but a finitely additive set func- 
tion defined on a field, it is not clear how to deal with the integral in It 13]) . Because of the 
technicalities involved in dealing with set functions, we do not further pursue the notion of 
q,- universality in this paper. 

Based on Theorem [6] the following result provides an alternate and equivalent character- 
ization of universality or injectivity of the embedding in ([2]), which is easier to interpret, 
as it resembles the condition of k being strictly pd (though not quite exactly the same). 
This alternate characterization is then used in Sections I3.2H3.4I to obtain easily checkable 
conditions for the universality of specific classes of kernels. We also show that strictly pd 
is a necessary condition for universality. 

Proposition 8 Suppose the assumptions in Theorem [6| hold. Then, 

(a) k is c-universal if and only if 

J J k(x,y)d»(x)dfj,(y) >0, V0^/i€M 6 (I). (17) 

(b) k is cc-universal if and only if 

k{x,y)dfM(x)dfi(y) >0, VO^eM^I). (18) 

(c) k is co-universal if and only if 

k(x,y)dfj,(x)dn(y) > 0, V0 / /i G M b (X). (19) 

(d) If ' k is c-, cc- or CQ-universal, then it is strictly pd. 



x 



x 
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Proof We only prove (c). The proof of (b) is exactly the same as that of (c) with M^X) 
replaced by Mb c (X), while the proof of (a) is trivial. 

(c) (<=) Suppose k is not co-universal. By Theorem E^c), there exists / /i G Mb(X) such 
that f x k(-,x) dfi(x) = 0, which implies || J x k(-,x) dfj,(x)\\ji = 0. This means 



( I k(;x)dn(x), j k(-,x)d/j,(x)^ = j j k(x,y) dn(x) dfi(y), 



where (e) follows from Lemma [26] (see Appendix A). By our assumption in f)19[) . this leads 
to a contradiction. Therefore, if (|19|) holds, then k is c^-universal. 

(=>) Suppose there exists / |i £ M&pf) such that JJ X k(x,y) dfj,(x) dfj,(y) = 0, i.e., 
|| j x k(-, x) dfi(x)\\j{ = 0, which implies f x k(-,x) dfi(x) = 0. Therefore, the embedding in 
(|12|) is not injective, which by Theorem O implies that k is not cg-universal. Therefore, if k 
is co-universal, then k satisfies ()19j) . 

(d) Suppose k is not strictly pd. This means for some n £ N and for mutually distinct 
x\, . . . , x n E X, there exists 19^-^0 for some j G { 1 , . . . , n} such that 

n 

aiotjk(xi, Xj) = 0. (20) 

Define /i := Y21=i a j&xji w bere 5 X represents the Dirac measures at x. Clearly \i ^ 
and fi S Mb c (X). From ([20]) . it is clear that JJ X k(x,y) dfx(x) dfj,(y) = 0. Therefore, by 
Proposition EJ^b), k is not cc-universal. The result for c n-universality f ollows from Re- 



m ark[7Kc), while the result for c-uni versality is trivial. See ICarmeli et al.l (120091, Corollary 



5uSteinwart and Christmannl (120081 . Proposition 4.54, Example 4.11) and lSriperumbudur et al 
d2009bj, Footnote 4). ■ 



Remark 9 (a) Although the conditions in J 17\ )- [TI%) resemble the strictly pd condition, 
they are not equivalent. By combining any of (a)-(c) with (d) in Proposition it is 
easy to see that if k satisfies any of |1 7|)-( T7Pp . then it is strictly pd. However, the con- 
verse is not true (see Rem a rk\12( a ) and the discussion following Example 0' also refer to 



Steinwart and Christmannl \200a . Proposition 4-60, Theorem for the related discus- 

sion). We show in Section \3J\ that in the case of radial kernels on M. d , the converse is true, 
i.e., k being strictly pd is also sufficient for it to be cc- or co-universal (see Proposition\16\). 

(b) The conditi on on k in \19() can be seen as a generalization of integrally strictly pd kernels 



(Stewart , \197a . Section 6): jj x k(x , y) f (x) f (y) dxdy > for all f G L 2 (R ), which is the 



strictly positive definiteness of the integral operator given by the kernel. 

A summary of results based on Theorem [6l Remarks [9] and Proposition [8] is shown in 
Figures (TJ^a) andQJb). 

Although the conditions in (|17p - (|19p are easy to interpret, they are not always easy to 
check. To this end, in the remainder of this section, we present easily checkable characteri- 
zations for the following classes of kernels. These classes of kernels are both mathematically 
and practically interesting as many of the popular kernels used in machine learning, e.g., 
Gaussian, Laplacian, exponential, etc., fall in these classes (see Examples HHS] for more 
examples) . 
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(Ax) k is translation invariant on R d x R d , i.e., k(x,y) = tf)(x — y), where / ip £ Ci>(R ) 
is a pd function on M d H 

(A2) Fourier kernel: k is translation invariant on T d x T d , where T d := [0, 27r) d , the d-Torus, 
i.e., k(x,y) = ip((x — y)mod27r)> where if) £ C(T d ) is a pd function on T d . 

(A3) k is a radial kernel on R d x i.e., there exists a finite nonnegative Borel measure, 
v on [0, 00) such that for all 1,1/6 R d , 

k(x,y)= [ e-^-y^dvit). (21) 



These kernels are also called Schoenberg kernels (jWendlandl . 120051 . Corollary 7.12, 
Theorem 7.13)0 

(A4) X is an LCH space with bounded k. Let k(x,y) = Y^jei ^ji^^jiv)-: { x ->v) £ X x X, 
where we assume the series converges uniformly on X x X. {(f) j : j G 1} is a set of 
continuous real- valued functions on X where I is a countable index set. 

3.2 Translation invariant kernels on R d : (Ai) 

The following result provides an easily checkable characterization for k to be co-universal 
or cc-universal (we do not consider c-universality as X = R d is not compact) when k is 
translation invariant on R d , i.e., when k satisfies (A±). Before we present the result, we need 
a theorem d ue to Bochner th at characterizes translation invariant kernels on R d , which is 
quoted from Wendlandl ( 2005 . Theorem 6.6). 

Theorem 10 (Bochner) ip £ Ci ) (M d ) is pd on R d if and only if it is the Fourier transform 
of a finite nonnegative Borel measure A on R d , i.e., 

$(x) = ( e~ ixTu dA(w), x £ R d . (22) 



Proposition 11 (Translation invariant kernels on R d ) Suppose (A\) holds. 

(a) Let if) £ Co(R d ). Then k is co-universal if and only i/supp(A) = R d ^ 

(b) If supp(ip) is compact, then k is co-universal. 

(c) If (supp(A))° 7^ 0, then k is cc-universal. 

Proof (a) (^) Consider JL d k(x,y) d/j(x) dn(y) for any O//16 Mb(R d ) with k(x,y) 
if)(x - y). 

B ■= k(x,y)dfi(x)dfi(y) = tf)(x - y) dfx(x) d/i(y) 



( J III e -i{*-y) T u dk(uj)dn{x)dn{y) 



'" 11 e- ixTu >dfi(x) I e iyT ^ dfi(y) dA(u) 



3. if) is said to be a pd function on R d if k(x, y) = tpix — y) is pd. 

4. Note that k is a scale mixture of Gaussian kernels. 

5. See (|4| for the definition of support of a Borel measure. 
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(/) 



fi{uj)fi{uj) dk{uj) 
f |AH| 2 dA(o;) 5 



(23) 



where Theorem 1101 is invoked in (d), Fubini's theorem (iFoflandl . Il999l . Theorem 2.37) in (e) 
and ([6]) in (/). If supp(A) = W 1 , then it is clear that B > 0. Therefore, by Proposition [S^c) , 
k is co-universal. 

(=>) Suppose k is co-universal, which by Theorem [6]^a) means that /j, h-> L d k(-,x) dfi(x) is 
injective for // G Mh(R d ). This means u ±± fw d k(-,x ) dfi(x) is injective for \i G M^(M d ) and 
therefore Theorem 7 in Sriperumbudur et al. ( 20081 ) yields supp(A) = M. d . 

(b) The proof is the same as that of Corollary 10 in ISriperumbudur et al. (|2009bh . bmce 
supp(?/>) is compact in M. d , by the Paley- Wiener theorem ( Rudinl . Il99ll . Theorem 7.23), we 
deduce that supp(A) = M d . Therefore, the result follows from Proposition II 1 fa) . 

(c) Consider f L d k(x, y) dfi(x) dfi(y) with k(x, y) = ip(x — y) and \i G M(, c (R d ). Since ([23]) 
holds for any fi G Mb(M d ), it also holds for any /x G M bc (R d ), i.e., 



B :-- 



k(x,y) dfi(x)dfi{y) 



\fi(uj)\ 2 dA(u). 



Since \i G Mi, c (M. d ), by the Paley- Wiener theorem (iRudinl . 1 199 ll . Theorem 7.23), we obtain 
that fi cannot vanish over an open set in R rf and supp(/i) = R d . Therefore if (supp(A))° ^ 0, 
then B > for every 7^ fi G Mb c (M. d ) and the result follows from Proposition E^b). ■ 

Pr oposition 1111 can easily b e extended to locally compact Abelian groups by using the ideas 



111 



Fukumiz u et al.l (I2009br) . Note that Proposition lllfc) matches with Proposition 15 in 



Micchelli et al.l (|2006l ). which is not surprising (see Remark [7|b)). Based on Proposition!!!] 
in the following, we provide some examples of cq- and cc-universal kernels that are transla- 
tion invariant kernels on M. d . 

Example 1 Let dA(u>) = (2ir)~ d / 2 ip(oj) did. Note that supp(A) = supp(^). The following 
kernels satisfy supp('0) = M. d and therefore are both cq- and cc-universal. 



(1) Gaussian, ip(x) 



exp 



2cr 2 



, cr > with ip(uj) 



a exp 

2\d/2 



(2) Laplacian, ip(x) = exp (— o"||x||i) , a > with ifj(uj) = (-) 
(ui,...,u d ). 



rij=i where u 



(3) Bi-spline, ip(x) = J^J^ = 1 ( 1 ~~ l x jl)l[-i,i]( x i) with i>(cj) 
x = (xi, ... ,x d ) and uj = (ui, . . . ,uj d ). 



n 



4 sin 2 (u) ,-/2) , 

a , where 



The following are some examples of translation invariant kernels on M. d that are not cq- 
universal but cc-universal. These kernels satisfy supp(^) C W 1 and (supp(^/>))° 7^ 0. 



(4) Sine kernel, ip(x) = Y\j = i 
SU pp(^) = [-a,a] d C R d . 



a G 



'D nUih-'rfto) and 
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(2?r) d / 2 
43 



n^iC 1 - i^i) 1 !-!,!]^' 



f5j Sinc-squared kernel, tp(x) = Ylj^—^r 2 - '■ V'(w) 
and supp(^) = [-1, l] d C M d . 
The following remarks can be made about Proposition 111! 

Wendland hooA ) hen 



in 



Remark 12 (a) Theorem 6.1 
k(x,y) = ip(x — y) is strictly pd. By Proposition \TW a), this means a strictly pd kernel need 
not be CQ-universal and therefore need not satisfy the condition in $19}) . i.e., strictly pd is 



not a sufficient condition for $19) to hold (see Remark\§(a)). As an example, a sinc-squared 
kernel is strictly pd but not CQ-universal (see ExampleUty. 

(b) In Proposition\B(d) , we have shown that strictly pd is a necessary condition for a kernel 
to be Co- or cc-universal. From the above remark, it is clear that k being strictly pd does 
not imply it is Co -universal. But does it imply k is cc-universal? In general, it is not clear 
whether this is true. However, iftp G Ch(R d )nL*(R d ) is strictly pd, then k(x,y) = tp(x — y) 
is cc-universal. This follows from I Wendlanc hOPh . Theorem 6.11, Corollary 6.12): if 
1/) G C b (R d ) n LV) is strictly pd, then O^i/ie L 1 ^), $ > and (supp(^))° ^ 0, which 
by Proposition \ll\ c) implies k is cc-universal. 

(c) Is the converse to PropositionM lVc) true? I.e., ifk is cc-universal, then does (supp(A))° 7^ 
hold? Let X = R. Suppose (supp(A))° = 0, which means supp(A) is of the form 
{0, ±wi, ±W2, • • •}, where / Wj 6 R for all j. Let us assume that there exists a non- 
zero entire function, h on C that satisfies (i) h(uj) = 0, V j and (ii) for each JVeN, there 
is a Cn such that 

\m\ < TTTjcr' 

for all C G C and some R > 0. Here Im C represents the imaginary part of Q. By the Paley- 
Wiener theorem \Reed and Simon\l912\ . Theorem IX. 11, p. 16), h G Co(R) is an infinitely 
differentiable function on R and supp(/i) C {x G R : |x| < R}. Define dfj,(x) = h(x) dx. It 
is easy check that 



k(x,y)dfi(x)dfi(y) 



k(x,y)h(x)h(y) dx dy = 2tt / h(oj) dK(uj 



2vr / |/i(w)| 2 dA(w) 



2^|/ l (^)| 2 A(K}) 



0. 



This means there exists 7^ [i G M& C (R) such that J J R k(x, y) dfi(x) d/j,(y) = 0, which 
means k is not cc-universal, by Proposition \^b). Therefore, if k is cc-universal, then 
(supp(A))° 7^ ; under the assumption that there exists an h that satisfies (i) and (ii) 
shown above. The construction of such an h is not straightforward for any k, and therefore 
it is not clear whether the above converse is t rue in general. 

On the other hand, Sriverumbudur et all (2009b\ . Example 5) have shown that if k is a 
periodic kernel (these kernels satisfy (supp(A))° = %), then such an h defined on R can be 
constructed. This means if k is cc-universal on R, then it is not periodic on R. However, 
this does not rule out the case ofk being cc-universal but aperiodic such that (supp(A))° = 0. 

A summary of results, based on Proposition [IT] and Remark [T2l for the case of kernels 
satisfying (Ai), is shown in Figure [ljc) . 
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3.3 Translation invariant kernels on T d : (A2) 

First note that since T d is a compact metric space, the notio ns of c-universality, cc- 
universality and CQ-universality are equivalent. ISteinwartl (|200ll . Corollary 11) provided 
a sufficient condition for a Fourier kernel to be c-universal. In Proposition 1141 we show that 
this condition is also necessary. Using this result, we then show that the converse to Propo- 
sition E^d) is not true. Before we present the result on the characterization of c-universality 
of kernels in (A2), we state Bochner's theorem that characterizes pd functions, ip on T d . 

Theorem 13 (Bochner) ip G C(T d ) is pd if and only if 

= Y, M n Y xTn i * e Td > ( 24 ) 

where : Z d — > M.+, A^(— n) = A^(n) and ^nez d A i>i n ) < 00 • A/> are called the Fourier 
series coefficients of if). 

Proposition 14 (Translation invariant kernels on T d ) Suppose (A2) holds. Then, k 
is c-universal if and only if A^(n) > 0, Vn G 7, d . 

Proof (<=) Consider fJ Td k(x,y) dfi(x) dfj,(y) for / // 6 Mf,(T d ). Substituting for k as 
in (A2) and for ip as in ()24p . we have 

5:=// k(x,y)dti(x)dfi(y) = If £ A^n)^"^" d//(s) d//(y) 

VA^(n)/ e lxTn d^(x) [ e-^d^y) 



(a) 
(6) 



(2tt) m £ ^(n)|^(n)| 2 , (25) 



where Fubini's theorem is invoked in (a) and 



A^(n) := (2vr)- d / e~ inTx dfi(x), n G (26) 

is used in (6). Note that is the Fourier transform of fi in T d . Since A^(n) > 0, Vn G Z d , 
we have B > 0, which by Proposition [H^a) implies k is c-universal. 

(=>) Proving necessity is equivalent to proving that if A/,(n) = for some n = no, then 
there exists 7^ G Mj,(T ) such that JL d /c(x, y) d/j,(x) dfi(y) = 0. 

Let A/,(n) = for some n = uq. Define dfi(x) = 2acos(x T no) dx, a G M\{0}. By ([26]) . 
we get Afj,{n) — ot5 no {n) , where 5 represents the Kronecker delta. This means u ^ 0. Using 

and in ([25]) . it is easy to check that JL d fc(x, y) dfi{x) dfj,(y) = 0. Therefore, is not 
c-universal. ■ 

Note that Proposition [TH provides an easy to check condition for the c-universality of 
translation invariant kernels on T d . 
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Example 2 The following are some examples of translation invariant kernels on T that 
are c-universal (and therefore co-universal and cc-universal) . 

(1) Poisson kernel, ip(x) = a -i_l^"osx+i ' < °" < 1 with A^(n) = o"' n l, n G Z. 

In 

(2) <^{x) = e acosx cos(asinx), < a < 1 with A<^(0) = 1 and A^(n) = S^, Vn / 0. 

(3) ip(x) = (vr - (x) mod 2 7t) 2 with A^O) = ^ and A^(n) = ^,Vn/0. 

Some examples of translation invariant kernels on T that are not c-universal ( and therefore 
not co-universal and not cc-universal) are: 

sin (Hi±i> 

(4) Dirichlet kernel, ip(x) = — — , / G N with A^(n) = 1 for n G {0, ±1, . . . , ±1} =: D 
and A^(n) = for n ^ D. 

C5j Fejer kernel, ip(x) = jq-j- — . 2 % — , I G N with A^(n) = 1 — for n G D and 
A^(n) = for n ^ D. 

c-universal kernels vs. Strictly pd kernels: We have shown in Proposition [HJ^d) that 
strictly pd is a necessary condition for k to be o, cc- or co-universal. However, the converse 
is not true (see Remark 0(a)), which is based on Proposition 1141 and the following result in 
Theorem 1151 Before we state the result, we need some definitions. 

For natural numbers m and n and a set A of integers, m+nA := {j G Z | j = m+na, a G 
A}. An increasing sequence {q} of nonnegative integers is said to be prime if it is not 
contained in any set of the form piN U p2^ U • • • U p n N, where pi,P2, ■ ■ ■ ,Pn are prime 
numbers. Any infinite increasing sequence of prime numbers is a trivial example of a prime 
sequence. We write N° := {0, 1, ... , n}. 



Theorem 15 riMeneefattol (11995^ Let ip be a pd function on T of the form in \21$ . Let 



N := {|n| : Aw,(r&) > 0, n G Z} C N U {0}. Then ip is strictly pd if N has a subset of the 
form Uf^ (bi + qN°), in which {bi} U {q} C N and {q} is a prime sequence. 

Suppose ijj be such that N C NU {0} has a subset of the form as mentioned in Theorem 1151 
Clearly, i/j is strictly pd. However, it is not c-universal as Proposition [14] states that k is 
c-universal if and only if A = N U {0}. 

A summary of results for kernels of the type (^2) is shown in Figure [T|b). 

3.4 Radial kernels on R d : (A 3 ) 

The following result provides an easily checkable characterization for k to be cq- and cc- 
universal (c-universality is not considered as X = M. d is not compact) when k satisfies (.A3). 



Proposition 16 (Radial kernels on R d ) Suppose (A3) holds. Then the following condi- 
tions are equivalent. 

(a) k is co-universal. 
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(b) supp^) ^ {0}. 

(c) k is strictly pd. 

(d) k is cc-universal. 



Proo f (a) =>• (d) by Remark[7^c), (d) =^ (c) by Propositioned) and (c) 44> (6) bv lWendland 
fj2005l . Theorem 7.14). Now, we show (6) =>- (a). 

Consider fJ R(i y) dfj>(x) d[i{y) with /c as in (f2Tj) . given by 



5 := 



k(x,y)dn(x)dfi(y) 



R d Jo 



-t\\x-y\\; 



dv(t) dfi(x) dfi(y) 



(e) 

(/) 



ci//(x) c^(y) 



(2t) rf / 2 

IAHI 2 



\fi{uj)\ 2 e « (iw 
1 



dv(t) 
dv(t) 



{2t) d / 2 



du{t) 



doj, (27) 



where Fubini's theorem is invoked in (e) and (g), while (|23|) is invoked in (/). Since 
supp(z^) 7^ {0}, the inner integral in (|27p is positive for every oj G M. d and so B > 0. 
Therefore A; is CQ-universal by Proposition [SJ ■ 

The above result shows that the notions of CQ-universality, cc-universality and strict positive 
definiteness are equivalent for the class of radial kernels on M. d . 

Example 3 The following radial kernels on M. d have supp(z^) 7^ {0} and therefore are cq- 
universal, cc-universal and strictly pd. 

(1) Gaussian, k(x,y) = e~ a ^ x ~ y ^ 2 , a > 0. Note that v = 5 a in (E2P, where 5 a represents 
a Dirac measure at a. Clearly supp(z^) = {a} 7^ {0}. 



c 2 + ||x — 7/H2) ^, f3 > 0, c > ; obtained by choosing 



(2) Inverse multiquadratic, k(x,y) 

dv(t) = rp)^~ le_c * dt in [21\) . It is easy to verify that supp(z^) 7^ {0}. 



A summary of results for kernels of the type (A) is shown in Figure [T^d). 
3.5 Kernels of type (A) 

We now consider the characterization of c-, cc- and CQ-universality for (A)- 
Proposition 17 (Kernels of type (A)) Suppose (At) holds. 

(a) k is c-universal fresp. cc-universal) if and only if for any / /j 6 Mj(X) fresp. 
0//jG Mbc(X) ), there exists some j G I for which J x dfi 7^ 0. 

(b) Let k(-,x) G Co(X), Vi G X. T/ien /c is c^-universal if and only if for any 7^ ^ G 
Mb(X), there exists some j £ I for which j x cfij da / 0. 
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Proof We first prove (b). The proof for c-universality in (a) is trivial as it follows from 
(b), while the proof for cc-universality in (a) is exactly the same as that of (b) with Mb(X) 
replaced by Mj, c (X). Let us consider 



x 



k(x,y)dn(x)dn(y) 



E 

J6/ 



X 



4>j(x) dfj,(x) 



(28) 



where we have invoked Fubini's theorem in (c). 

(b) ( <^= ) Suppose for any ^ fj, G Mb(X), there exists some j G I for which J x cftj dfi ^ 0. 
Then, from ([28]) . it is clear that J [ x k(x, y) d/j,(x) d[i(y) > 0, V0 M^X) and therefore 

k is co-universal, which follows from Proposition Etc) . 

( =^ ) Suppose there exists a non-zero measure, /i G M^(X) for which cfyi = for any 

j G J. By ([28]) . this means there exists a ^ \x G M&pf) for which J" y) d[i(x) dfi(y) = 
0, i.e., is not co-universal (by Proposition [8jc) ) . ■ 

The conditi ons in Proposition 1 171 are not alway s easy to check. However, for the case of Tay- 
lor kernels (jSteinwart and Christmannl . 120081 , Lemma 4.8), which include the exponential 
kernel, simple, easy to check sufficient conditions can be obtaine d as shown in Corollar y [TBI 
Altho ugh this result is exactly the same as Corollary 4.57 in ISteinwart and Christmann 
(120081 k we present a different proof (we would like to remind the reader that our characteri- 
zation of c-universality is different from the one provided by Steinwart ( 200ll ) and therefore 
the proof is different; see Remarket a)). 

Corollary 18 (Universal Taylor kernels) Let X := {x G M. d : \\x\\2 < y/r}, where r G 
(0,oo]. Let f(t) = Y^^=o a nt n , t G (-r,r). If a n > 0, M n > 0, then k(x,y) = f{x T y), x,y G 
X, is c-universal on every compact subset of X. 



Proof From the proof of Lemma 4.8 in Steinwart and Christmann (2008), we have 



Hx,y) = f{x T y) = ^ j a n {x T y) 1 



n=0 



a c n x a y a , 



(29) 



where a := (ay : j G \a\ := Yl 



, Xd) and x a 



Ylj=i(xj) aj ■ From ([29]), it is clear that k(x,y) = Y^aeN d ( t ) a{x)4>a{y)-, x, y G X, where 
4>a{x) = ^a\ a \c a x a . Let a\ a \ > for all a G N d . Then it is clear that for any ^ fi G Mf,(X), 
there exists a G N rf such that J x x a dfj,(x) ^ 0. Therefore, by Proposition [T71 k is c- 
universal. ■ 

Examples of kernels that satisfy the conditions in Corollary 1181 and therefore are c-universal 
include the exponential kernel, k{x,y) = exp(x T y), x,y G M d , binomial kernel, k(x,y) = 
(l-x T y)-P, > 0, defined on X x X, whe r e X := {x G R : ||x||2 < 1}, etc. See Examples 
4.9 and 4.11 in Steinwart and Christmann ( 20081 )). 



To summarize, in this section, by showing the relation between various notions of universal- 
ity and the injective RKHS embedding of finite signed Radon measures, we have presented 
a novel measure embedding point of view of universality compared to its well-known func- 
tion approximation view point. Since the RKHS embedding of finite signed Radon measures 
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generalizes the concept of RKHS embedding of Radon probabi l ity measures, the latter bein g 



related to characteristic kernels fukumizu et all liool B: ISrioerumbudurl^fil lioOfiK. 



in the following section, we relate the notion of universality to characteristic kernels. 



4. Characteristic Kernels and Universality 

Recent studies in machine learning have considered the mapping of random variables into 
a suitable RKHS and showed that this provides a powerful and straightforward method 
of dealing with higher-order statistics of the variables. Using their RKHS mappings, fo r 



sufficiently ric h RKHSs, it becomes possible to test for homogen eity (IGretton et all 120071 ) . 



independence ( Gretton et al. . 20081 ). conditional independence (Fukumizu et al. . 20081 ). to 



find the most predictive subspace in regression (jFukumizu et all 12004 ). etc. Key to the 
above applications is the notion of a characteristic kernel — defined below — which gives 
rise to an RKHS that is sufficiently rich in the sense required above. 

Definition 19 (Characteristic kernel) Let X be a topological space, ¥ be a Borel prob- 
ability measure on X and k be a measurable, bounded kernel on X. Then k is said to be 
characteristic if the embedding, 



x 



k(-,x)dF(x), 



(30) 



is injective. 



Since the embedding in (|30p is a special case of the embedding in (|2|), and the injectivity of 
the embedding in ([2]) is related to universality (see Section [3]), we now relate universal and 
characteristic kernels. 



4.1 Main results 

Gretton et all (j2007l ) have shown that a c-universal kernel is characteristic. Besides this re- 
sult, not much is known or understood about the relation between characteristic and univer- 



sal ker nels. The following result not only provides the same result obtained by IGretton et al 
(|2007h . )ut also generalizes it for non-compact X. 



Proposition 20 (Universal and characteristic kernels— I) Suppose the assumptions in 
Theorem^ hold. If k is c-, cc- or c^-universal, then it is characteristic to the set of proba- 
bility measures contained in M^X), Mt> c (X) or M^X), respectively. 

Proof The proof is trivial and follows from Theorem [6] and Definition [191 M 

Now, one can ask when the converse to Proposition [20] is true. The following result answers 
this question for some special classes of kernels. 

Proposition 21 (Universal and characteristic kernels— II) The following hold: 

(a) Suppose (A%) holds with ip E Co(M d ). Then, k is co-universal if and only if it is 
characteristic to the set of all Borel probability measures on W 1 . 

(b) Suppose {A2) holds. Then, k is c-universal if it is characteristic to the set of all Borel 
probability measures on T d and A^(0) > 0, where A^ is defined in {21$ . 
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(c) Suppose (A3) holds. Then, k is cc-universal if and only if it is characteristic to the 
set of all Borel probability measures on R d . 

Proof (a) Suppose k is co-universal. Then, by Proposition l20l k is characteristic to M+(M. d ). 
Conversel y, if k is characteristic to M 1 (M. d ), we have supp(A) = W 1 which follows from The- 
orem 7 in Sriperumbudur et al. (2008). The result therefore follows from Proposition II If a). 



(b) iFukumizu etaD (|2009bl . Theorem 8) and ISriperumbudur et al, I (|2009bl . Theorem 14) 



have shown that k is characteristic to M+(T d ) if and only if A/,(0) > 0, A^,(n) > 0, Vn 6 
Z d \{0}. Therefore, if k is characteristic with A^(0) > 0, then it is c-universal by Proposi- 
tion ni 

(c) If is cc-universal, then by Proposition [161 it is co-universal, and thus characteristic 
to M^_(K. d ) by Proposition [20l To prove the converse, we need to prove that if k is not 
cc-universal, then it is not characteristic to M^(M. d ). If k is not cc-universal, then by 
Proposition 1161 we have supp(z^) = {0} (see ()2ip for the definition of u), which means the 
kernel is a constant function on R d x R d and therefore not characteristic to Mi(R ). ■ 

Remark 22 (a) If k is co-universal, then k is characteristic, which follows from Propo- 
sition [2b\ In general, the converse is not true, which follows from Proposition [TJ and 
Proposition WWb). However, on the class of translation invariant kernels and radial kernels 
defined overM. d , the converse is true, which is shown in Proposition \2lV a. c). 

(b) Although an RKHS , "K can be characteristic without containing constant functions 
{Fukumizu et all , 2009b\ . Lemma 1), PropositionWWb) shows that ifK does contain constant 
functions (i.e., A^(Q) > 0), then the class of characteristic kernels onT d is equivalent to the 



class of c-universal (an d, therefore, cc- and cp -universal) kernels. Based on lFukumizu et al. 



1 2009a . Lemma 1) and \Carmeli et all 1(20091 . Theorem 1), this result can be generalized to 



any LCH space, X, which says that if constant functions are included in "K, then charac- 
teristic kernels are equivalent to CQ-universal kernels. 

A summary of the relation between characteristic and universal kernels is shown in 
Figure [TJ 

Characteristic kernels vs. Strictly pd kernels: In Section [3l we have shown the 
relation between universal kernels and strictly pd kernels, while in Propositions [20] and 
[2Tj we have related universal and characteristic kernels. We now investigate the relation 
between characteristic and strictly pd kernels. 

Based on Propositions [Til [TBI and [2TT it is clear that a characteristic kernel that is 
translation invariant or radial on R d is strictly pd. While the converse holds for radial 
kernels on M. d , it does not hold for translation invariant kernels on R d , which follows from 
Proposition [21] and Remark 112(a). Similarly, in the case of translation invariant kernels 
on T, if a kernel is characteristic, then it is strictly pd, which follows from Theorem [TS] 
and Proposition [2T| while the converse is not true. So far, we have presented the relation 
between characteristic and strictly pd kernels for specific cases of kernels satisfying (^4i)- 
(^3), which is summarized in Figure [TJ For the general case, it is not clear whether strict 
pd is a necessary condition for k to be characteristic. However, the following result shows 
that conditionally strictly pd is a necessary condition for k to be characteristic. 

Proposition 23 // k is characteristic, then it is conditionally strictly pd. 
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Proof Suppose k is not conditionally strictly pd. This means for some n > 2 and for 
mutually distinct x\,...,x n £ X, there exists {ay} 7^ with 5Zj=i a j = such that 
Y^ij=i otiotjk(xi,Xj) = 0. Define p := Y^j=i a j^Xj , where S x represents the Dirac measure at 
x. Clearly, p is a finite non-zero Borel measure that satisfies {%) ff x k(x, y) dp(x) dp(y) = 
and (ii) u(X) = . Sinc e p is a finite non-zero Borel measure, by the Jordan decomposition 
theorem ([Dudlevl . 120021 . Theorem 5.6.1), there exist unique positive measures p + and p~ 
such that p = p + — p~ and p + _L p~ {p + and pT are singular). By (ii), we have p + (X) = 
p~(X) =: a. Define P = a~ 1 p + and Q = a p~. Clearly, P and Q are distinct Borel 
probability measures defined on X. Then, we have 



k(-,x)d¥(x) 



x 



k(-,x) 



(a) 



•K 



k(x,y) d(F-q)(x)d(F-Q)(y) 



a 



k(x, y) dn(x) d(i(y) = 0, 



M 



where Lemma [26] is invoked in (a) and (b) is obtained by invoking (i). So, there exist P 7^ Q 
such that J x k(-,x) d(P — Q)(z) = 0, i.e., k is not characteristic. ■ 

The converse to Proposition [23] is however not true. 

So far, we presented the relation between characteristic kernels and universal kernels 
and showed that for any LCH space, X, the characteristic property is a weaker notion than 
co-universality. Although such a weaker notion is sufficient to make the embedding in f|30[) 
injective, in the following section, we show that the stronger notion of CQ-universality is 
required to study an important property of the "probability metric" associated with the 
embedding in ([50]) . 

4.2 Metrization of weak topology on M}(X) 

Let X be a Polish s paced Based on the embedding, P H- f x k(; x) dF(x), P G M\{X), 
Gretton et al.l f)2007l ) proposed the following pseudometric, 



fc(-,a;)dP(a;) 



x 



k(-,x)dQ(x] 



x 



(31) 



■K 



on iVfi(X), called the maximum mean discrepancy (MMD). Note that when k is charac- 
teristic, 7^ is a metric on M+(X). One immediate question that naturally arises is "how 
is MMD related to other metrics on M\{X\ such as the Prohorov metric, Dudley metric, 
Wasserstein-Kantorovich metric, total variation metric, etc?" This is a question of both 
theoretical and practical importance. 

For example, let us consider the problem of estimating an unknown density based on 
finite random samples drawn i.i.d. from it. The quality of the estimate is measured by 
determining the distance between the estimated density and the true density. Given two 
probability metrics, p\ and pi, one might want to use the stronger^ of the two to determine 



6. A topological space (X, r) is called a Polish space if the topology r has a countable basis and there exists 
a complete metric defining r. 

7. Two metrics pi :FxF-> R+ and p2 : Y x Y — > R+ are said to be equivalent if pi(x, y) — <£4> pzix, y) = 
0, Vi, y G Y . On the other hand, pi is said to be stronger than p2 if pi(x, y) = pi(x, y) = 0, V x,y G 
Y but not vice- versa. If pi is stronger than p2, then we say p2 is weaker than pi. 
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this distance, as the convergence of the estimated density to the true density in the stronger 
metric implies the convergence in the weaker metric, while the converse is not true. On the 
other hand, one might need to use a metric of weaker topology (i.e., coarser topology) to 
show convergence of some estimators, as the convergence might not occur w.r.t. a metric 
of strong topology. This motivates a deeper analysis of the relation between probability 
metrics, e.g., as mentioned before, the relation between MMD and other popular probability 
metrics to, e. g., determine which metrics a re stronger respectively weaker. 

Recently, Sriperumbudur et al. ( 2009bl ) studied the relation between MMD and other 
probability metrics such as the Prohorov distance, Dudley metric, Wasserstein distance and 
total variation distance and showed that MMD is weaker than all these other metrics. This 
means that the topology induced by MMD is coarser than the topology induced by all these 
other metrics on Mi(X). It is well known that the Prohorov and Dudley metrics induce a 
topology that coincides with the weak topology (also called the weak-* (weak-star) topology) 
on M+(X), defined as the weakest topology such that the map P H > f x f dPis continuous for 
all / G Cb(X). This naturally leads to the question, "For what k does the topology induced 
by MMD coincide with the weak topology?" In other words, "For what k is MMD equivalent 
to the Prohorov and Dudley metrics?" Although we arrived at this question motivated by an 
application, this question on its own is theoretically interesting and important in probability 
theory, especially in proving central limit theorem s. Before we answer it (this q uestion was 



answered for compact Hausdorff, X and X = Mr in lSriperumbudur et al.1 (l2009bl . Section 5), 
whereas in the following, we answer it for general LCH spaces), we need some preliminaries. 

The weak topology on M\{X) is the weakest topology such that the map Ph- J x fdF 
is continuous for all / G C\j{X). A sequence of measures is said to converge weakly to P, 
written as F n P, if and only if f x f dF n —¥ f x fdP for every / G C&(X). A metric 7 on 
M]_{X) is said to metrize the weak topology if the topology induced by 7 coincides with 

P 



the weak topology, which is defined as follows: if, for 



G MUX), (P„ 



7(Pn,P) 



0) holds, then the topology induced by 7 coincides with the weak topology. 



Proposition 24 Let X be an LCH space and k be CQ-universal. Then, the topology induced 
by 7fc coincides with the weak topology on M}(X). 

Proof We need to show that for measures P, Pi, P 2 , . . . G M\{X), F n 4 P if and only if 
7fc(P rt ,P) -» as n — > 00. To prove the result, we use an equivalent representation of 7^ 
given by Sriperumbudur et al. (|2008l . Theorem 3), 



7fc(P,Q) 



sup 

ll/lk<i 



fdP 



x 



f< 



X 



sup 

/eft 



\J x fdP-j x fdi 



ft 



(32) 



(<=) Define Ff := J x fdF. Since k is co-universal, "K is dense in Cq{X) w.r.t. || • || u , i.e., 
for any / G Cq{X) and every e > 0, there exists a g G "K such that ||/ — g\\ u < e. Therefore, 



Since 7fc(P n ,l 



|P n / - P/| = |P„(/ - g) + P( 5 - /) + (¥ n g - Fg)\ 

< F n \f-g\+F\f-g\ + \F n g-Fg\ 

< 2e + \F n g-Fg\ < 2e + \\g\\^ k (F n ,F 

as n — > go and e is arbitrary, |P n / — P/| 



(33) 



for any / G C Q (X). The 
result follows from iBerg et al.l (|1984l . Corollary 4.3), which says that if F n f — > Ff, V/ G 
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C (X), then F n f — >■ P/, V / G C b (X), i.e., P n 4 P. 

(=>•) Suppose P n 4 P, i.e., P n / -»• P/, V/ € C 6 (X). This implies P n / -> P/, V/ G and 
therefore 7fc(P n , P) — > as n — > oo. ■ 

Proposition [2H shows that if k is co-universal, then MMD induces the same topology as 
induced by the Prohorov and Dudley metrics and therefore is equivalent to both these 
metrics. This means that, although k being characteristic is sufficient to guarantee 7^ 
being a metric, a stronger condition on k, i.e., k being co-universal is required for 7^ to 
metrize the weak topology on M+(X). 



The following result in ISriperumbudur et al.1 (l2009bl . Theorem 23) can be obtained as 



a simple corollary to Proposition [24"1 wherein the question of metrization of weak topology 
by 7^ is addressed only for compact Hausdorff X. The general non-compact case was left 
as an open problem, which we addressed in Proposition! 



Corollary 25 ( Sriperumbudur et al. ( 2009bl )) Suppose X is compact Hausdorff and 



k is c-universal. Then, 7^ metrizes the weak topology on M}(X). 

Proof When X is compact, c-universality and co-universality are equivalent (see Re- 
mark [T^c)). Therefore, the result follows from Proposition 1241 ■ 

To summarize, in this section, we have related the notions of universality and charac- 
teristic kernels by exploiting the relation between universality and the RKHS embedding of 
Radon measures, which is discussed in Section [3l We showed that universal and character- 
istic kernels are equivalent on the clas s of translation invariant and radial kernels on W 1 . In 



addition, one of the open questions in ISriperumbudur et al.l (|2009bl . Section 5) is addressed 



by determining the conditions on k so that 7^ metrizes the weak topology on the space of 
probability measures, defined on a general non-compact X. 

5. Conclusions & Discussion 

In this work, we have considered the problem of embedding finite signed Borel measures 
into an RKHS — which is a generalization of the recently studied concept of embedding 
Borel probability measures into an RKHS — and studied the conditions on the kernel 
under which this embedding is injective. We showed that the injectivity of this embedding 
is related to the notion of universality: the embedding is injective if and only if t he kernel is 



universal. In other words, compared to earlier c haracterizations of universality (jSteinwartl . 



200ll : iMicchelli etail 120061 : ICarmeli et all . l20Q9h , we have provided a novel characterization 



for universal kernels, which is based on the measure embedding view point as opposed to 
the point of view of function approximation. In addition, because of this relation between 
universality and the injective embedding of finite signed Borel measures, we established the 
relation between universal and characteristic kernels, the latter being related to the injective 
embedding of Borel probability measures into an RKHS. As an example, we showed the 
universal and characteristic property to be equivalent in the case of translation invariant 
and radial kernels on R rf . 

The discussion in this paper has been related to the characterization of various no- 
tions of universality wherein the RKHS, "K is dense in some subset of C(X) (the space 
of real-valued continuous functions on X) w.r.t. the uniform norm (here, X is a some ar- 
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bitrary topological space). This means any target function, /* in the appropriate subset 
of C(X) can be approximated arbitrarily well by some g £ JC w.r.t. the uniform norm. 
There is a notion of universality , which we have not considered, called L p -universality 
( Steinwart and Christmannl . 120081 . Chapter 5): a measurable and bounded kernel, k defined 
on a Hausdorff space, X, is said to be L p -universal if the RKHS, K induced by k is dense in 
If(X,n) w.r.t. the p-norm, defined as ||/|| p := (J x \f(x)\P d^x)) 1 ^, for all n e M\{X) and 
some p G [1, oo). Here L p (X,fi) is the Banach space of p-integrable //-measurable functions 
on X. This notion of universality is more applicable in learning theory, where the target 
function, /* is usually assumed to lie in L p (X,/j>) for some p £ [l,oo) and for some Borel 
probability measure, [i. By considering this notion of universality, any /* £ LP{X,\x) can 
be approximated arbitrarily well by some g S K w.r.t. the p-norm for all Borel probability 
measures [i and some p G [1, oo). In particular, ISteinwart and Christmann (120081 . Theorems 
5.31, 5.36 and Corollary 5.37) have shown that L p -universality is necessary and sufficient 
to achieve consistency in kernel-based learning algorithms. In this paper, we did not con- 
sider this notion of universality because unlike the other notions of universality, it is not 
straightforward to relate L p -universality and the RKHS embed ding of measu r es by using 
the Hahn-Banach theorem (see Theorem [5]). However, recently. ICarmeli et al.l (|2009l . The- 
orem 1) have shown that k is L p -universal if and only if it is co-universal, which therefore 
establishes the relation between L p -universality and the RKHS embedding of measures. Us- 
ing this result, L p -universality can be related to all other notions considered in this paper, 
through Figure [TJ 
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Appendix A. Supplementary Results 

For completeness, we present the following sup plementary result, wh i ch is a simple gen- 
eralization of the technique used in the proof of ISriperumbudur et al. (|2008l . Theorem 3). 



Lemma 26 Let k be a measurable and bounded kernel on a measurable space, X and let "K 
be its associated RKHS. Then, for any f 6 IK and for any finite signed Borel measure, \i, 

[ f(x)dfi(x) = [ (f,k(;x))xd(i(x) = /f,f k(;x)dfx(x)\ (34) 

J X J X J X 

Proof Let : K — > R be a linear functional defined as T^[f] := J x f(x) dfj,(x). It is easy 
to show that 

ITJ/II / 

||T M || := sup -1^: — < supk(x,x)\\n\\ < oo. 
f&i H/Hw Y xex 
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Therefore, T jL is a bounded linear functional on "K. By the Riesz representation theorem 
( Folland . 19991 . Theorem 5.25), there exists a unique A M 6 'K such that T^f/] = (/, A M )jf 
for all / € IK. Set / = k(-,u) for some which implies A M = J* x k(-,x) dfi(x) and the 

result follows. ■ 
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